Parallel Checkpoint/Recovery on Cluster of IA-64 Computers
نویسندگان
چکیده
We design and implement a high availability parallel run-time system---ChaRM64, a Checkpointbased Rollback Recovery and Migration system for parallel running programs on a cluster of IA-64 computers. At first, we discuss our solution of a user-level, single process checkpoint/recovery library running on IA-64 systems. Based on this library, ChaRM64 is realized, which implements a user-transparent, coordinated checkpointing and rollback recovery (CRR) mechanism, quasi-asynchronous migration and the dynamic reconfiguration function. Owing to the above techniques and efficient error detection, ChaRM64 can handle cluster node crashes and hardware transient faults in a IA-64 cluster. Now ChaRM64 for PVM has been implemented in Linux and the MPI version is under construction. As we know, there are few similar projects accomplished for IA-64 architecture.
منابع مشابه
Parallel Spatial Pyramid Match Kernel Algorithm for Object Recognition using a Cluster of Computers
This paper parallelizes the spatial pyramid match kernel (SPK) implementation. SPK is one of the most usable kernel methods, along with support vector machine classifier, with high accuracy in object recognition. MATLAB parallel computing toolbox has been used to parallelize SPK. In this implementation, MATLAB Message Passing Interface (MPI) functions and features included in the toolbox help u...
متن کاملParallelKnoppix - Rapid Creation of a Linux Cluster for MPI Parallel Processing Using Non-Dedicated Computers
This note describes ParallelKnoppix, a bootable CD that allows creation of a Linux cluster in very little time. An experienced user can create a cluster ready to execute MPI programs in less than 10 minutes. The computers used may be heterogeneous machines, of the IA-32 architecture. When the cluster is shut down, all machines except one are in their original state, and the last can be returned...
متن کاملFail-safe PVM: A portable package for distributed programming with transparent recovery
Many scientific problems benefit from computations that are parallel at a coarse grain. Collections of looselycoupled, heterogeneous computers are increasingly being applied to these problems. While individual computers are designed to be relatively reliable, a collection of several autonomous machines necessarily has a greater rate of failure. As data networks improve, and larger multicomputer...
متن کاملScalable Fault Tolerance in Multiprocessor Systems
Evolving trends in design and use of computers are resulting in fault-prone systems which may not execute a program to completion. Checkpoint-and-recovery is commonly used to recover from faults and complete parallel programs. Conventional checkpointing-and-recovery can incur high overheads and may be inadequate in the future as faults become frequent. We propose to execute parallel programs de...
متن کاملDefining the Checkpoint Interval for Uncoordinated Checkpointing Protocols
Parallel applications running on large computers suffer from the absence of a reliable environment. Fault tolerance proposals, in general, rely on rollback-recovery strategies supported by checkpoint and/or message logging. There are well-defined models that address the optimum checkpoint interval for coordinated checkpointing. Nevertheless, there is a lack of models concerning uncoordinated ch...
متن کامل